experimental/ssh: show compute provisioning status during ssh connect startup#5576
Conversation
… startup GPU_8xH100 serverless capacity takes ~10 minutes at P50 and ~30 minutes at P90 to acquire, but while waiting `ssh connect` only showed a generic "Starting SSH server... (task: PENDING)" spinner, so users assumed a long wait meant a service outage (see the Zillow report in #remote-development-help). Show "Waiting for compute to start..." while the bootstrap job's compute spins up (all connection types, including dedicated-cluster auto-start), and print an upfront notice for GPU accelerators that provisioning can take upwards of 10 minutes. The startup timeout increase for GPU accelerators is handled separately. Co-authored-by: Isaac
Integration test reportCommit: 569e075
Top 5 slowest tests (at least 2 minutes):
|
There was a problem hiding this comment.
Thanks — the diff is clean and the intent is right. Two requested changes on the provisioning notice, both about the wording.
1. Differentiate the message by accelerator type
Right now GPU_1xA10 and GPU_8xH100 get the identical "upwards of 10 minutes" notice, but their provisioning latencies differ a lot — a single A10 is typically acquired much faster than an 8×H100 node. Telling an A10 user to expect 10+ minutes is misleading, and the 8×H100 case arguably warrants a stronger heads-up (P90 ~30 min).
Suggest keying the message off opts.Accelerator — e.g. a small map[string]string of accelerator → expected-time phrasing, with a generic fallback for anything not in the map. That also keeps it correct as new accelerator types are added.
2. Tighten the wording
"upwards of 10 minutes" is a touch informal and slightly misrepresents the data: with P50 ≈ 10 min it implies 10 min is the floor, when in fact roughly half the time it finishes faster — and the real pain is the ~30 min P90 that drove the 45-min timeout in #5569. Anchoring on a range is more useful to someone staring at a long PENDING state. The trailing ... also reads casual for a one-time sentence (vs. the ongoing spinner text, where it fits).
Suggested wording:
- GPU_8xH100:
Provisioning GPU_8xH100 compute. This typically takes around 10 minutes and can exceed 30 minutes when capacity is constrained. - GPU_1xA10:
Provisioning GPU_1xA10 compute. This usually takes a few minutes, longer when capacity is constrained.(adjust to the latency we actually observe)
The matching spinner text can stay short, e.g. Provisioning GPU_8xH100 compute....
The provisioning heads-up for GPU accelerators was identical for every type and said "upwards of 10 minutes", which is misleading: a single GPU_1xA10 is typically acquired in a few minutes, while a GPU_8xH100 node is ~10 min at P50 and can exceed 30 min at P90. Key the notice off the accelerator type via a small map with a generic fallback, and anchor the wording on a range rather than a floor so it stays useful to someone staring at a long PENDING state. Co-authored-by: Isaac
Changes
While the SSH server bootstrap job's compute spins up, the spinner now reads
Waiting for compute to start...(all connection types) instead ofStarting SSH server.... For GPU accelerators, a persistent notice is printed upfront:Waiting for GPU_8xH100 compute to be provisioned. This can take upwards of 10 minutes depending on capacity....Why
ssh connect --accelerator=GPU_8xH100frequently fails with:GPU_8xH100 launch latency is ~10 minutes at P50 and ~30 minutes at P90, so sessions routinely hit the startup timeout even when nothing is wrong. Nothing in the output indicated that compute was being provisioned, so users read the error as a service outage.
Tests
go build,go vet, andgo test ./experimental/ssh/...all pass;TestWaitForJobToStartSurfacesFailureupdated for thewaitForJobToStartsignature change.This pull request and its description were written by Isaac.